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oonTnNS OF THE ZINC-FINGER 



BACKGROUND OF THE INVENTION 

, .„„dm B potential nudacacid-binding prcans 

Asupeflarmlyofeutaryom.*-. - -• Proims tot have these 

fi „„„ (ZF ) demons of the CysyH^ (C 2 H,) class 

ch> rac,ensuc s.ctura, l . rt ervstaiiographtc -r— - 

Sequence colons, mutauona, a * ^ of B , om DNA 

^ that each «*« 1 — a DNA tnp* These -spec* 

^ contacts »th some or ^ . soedfic positions tn * ► 

interactions arc mediated through amino acid (AA) 

W 1 5 " 101 °^ Pr0Uta '""I tlm , ,00 ZF motifs have been identified, the 
^oughtheAAseouencesofmorethanl. b , 5 mfoml .ion on 

DN A E ontac, regions conce^s W ^ ^ ^ ^ fot „ lK 

^ne-nch sues t ».W ^^J, S „ ES taveb een made ,»..«]. 

re ,a„ng ZF sequences to preferred DNA b. » ^ ^ ^ du£ l0 lhe 



act that neither computer iuu » tp-DNA contact region. 

=nough informaoon on the overa* — ^ lhe ^ OT d,„ons in the 
Using Phystc* atomicmo^ mode s to charter* ^ ^ 

• c fnr different ZF-DNA interactions, an odjcu 
specific contact posmons for «*» fer ^ rKOgrJt ,o„ for 

» the present mvennon - » « ^ J, had been reached, the «* of the 



nM A-mnamE poivpepuao . • 
°"* £dea '" Tle'of — e ,„ the fie*. — character „ u» 
represents a major advance oi m. 
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disclosures of Rebar, et. aj. and Beerii, et. a) f 15 |g] n 

* seleco, usms the phage J ' - - concerned 

«n g species. 0 „ * other hMd , he ' ' SP ; C ' fiC " **» * - DNA- 

DNA-bmdin, nro.ein, f „. .... .,. [ ? " d,SCl ° SUre is c °°<*™d w„h ,he de„„ „ f 

" w * ^ £ ive n una sequence. 

SUMMARY OF THE INVENTION 



The invention is directed m th* ,j 

— * c 2 „ : 2 , nc , nger ; o f ™ A ,,„ di „ s proKms 

» S™ula descnbing ,he class cf DBfUT"^ ^ bM " de, ™«d. and 

tave been d e ttrTmned for lhe ^ * "» *™ A L, s „, 

".ve s, g „ncan t and use4ll DNA . bmdmg ^ j ' °°' <™>S „p,„„ b , ndmg do 

BWEF DESCRIPTION OF THE DRAWINGS 
Figure 1 depicts the alignment of rv a* 

We 2 is a sch ■ " Van ° US ^ DBF's 

g«w 2 « a schemata representation of the interact,™ u 

"d a single ZF domain. t,0n betWeen a tar S« DNA triplet 

Figure 3 is a schematic 

" interaction between a target DNA string 

Figure 4 is a block flow diagram of the com 
des lg n process ,s impiemented. ^ by ^ the ™™ DBP 

Figure 5 is a block flow diagram wherein th. r 
is further broken down. computer Program block (2) 

of Figure 4 

Figure 6 is a block flow diagram wherein th, p 
»'es block (2) of Figure 5 is fcJL. Pr0C - G — into BloCng Fragment 



of 9 bases and a rh „ repreSemat '° n of *° interaction between 
Dases a™ a three-domain DBP. 



broken down. 



1 igure / is a block flow diagram wherein the Design 
F'gure 3 is further broken down. S 



DBP's for a Genome block (3) of 
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Figure 8 is a block flow diagram wherein block (22) of Figure 7 is further broken 

down. 

Figure 9 is a block flow diagram wherein block (24) of Figure 7 is further broken 

down. 

Figure 10 shows the distribution of binding strengths of acceptable 9-finser DBP's 
across the yeast genes analyzed. 

Figure 1 1 shows the values of the binding energies of the acceptable 9-finger DBP's 
found for the yeast genes analyzed. 

Figure 12 shows the distribution of DBP subsite (spurious) binding energies across the 
yeast genes analyzed. 

Figure 13 shows, in nonloganthmic fashion, the distribution depicted in Figure 12. 

Figure 14 shows the ratios of binding energy to subsite (spurious) binding energy, 
across the yeast genes analyzed, for the acceptable 9-finger DBP's 

Figure 15 shows the values of the spurious binding energies for each of the 27-base- 
pair (bp) frames of the 300-bp promoter region of yeast gene YAR073 

Figure 16 shows the ratios of binding energy to subsite (spurious) binding energy for 
each of the 27-base-pair (bp) frames of the 300-bp promoter region of yeast gene YAR073. 

F:gure 17 shows the distribution of sizes of acceptable DBP's across the C elegans 
genes analyzed. 

Figure 18 shows the ratios of binding energy to subsite (spurious) binding energy, 
across the C elegans genes analyzed, for the acceptable DBP's. 
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DETAILED DESCRIPTION OF THE INVENTION 

The general rules governing the binding of C2H-> ZF motifs to DNA were developed 

by using a combination of the database analysis of the homologies between 1,851 possible ZF 
domains and physicai moiecuiar modeling of the interaction of a DBP model with a DNA 
model containing all 64 possible base-pair tnplets. The DBP model approximates the size and 
shape of a half-gallon jug of milk. The DNA model is approximately four feet long and one 
foot in diameter. The axis of the DNA model is horizontal and can be rotated to observe each 
of the 64 base-pair triplets. By moving the DBP model in and out with respect to the DNA 
model one can observe the amino acid and nucleic acid contacts. 

Although the following description details the scientific precedents of this invention, 
the completeness of the rule set governing the DBP-DNA interaction could have only been 
obtained by the continual, derivative interplay of data base analysis and physical modeling 
during the invention period Observations as to the conservation and variability of amino acids 
at various places in the ZF motif were embodied, first, by constructing a physical model of the 
ZF motif and, then, by physically modeling the interaction of a specific DBP with a designated 
DNA bp triplet. The physical modeling indicated patterns of amino acid and nucleic acid 
interaction which led to further analysis of the database Iterations of this interplay between 
database analysis and physical modeling enabled conceptual refinement and expansion of the 
nature of contact patterns. As these patterns emerged, systematic variation of the amino acids 
in the ZF motif was undertaken for each of the 64 base-pair triplets. The physical modeling of 
the interaction between a DBP and DNA was efficient because alternative amino acids could 
be easily introduced into the ZF motif and the resulting protein physically modeled against the 
DNA. Hydrogen bonding, and water and hydrophobic contacts could then be modeled, clearly 
determined and counted very quickly. From this physicai modeling a general set of rules was 
developed which incorporates criteria for the design of DBP' s that specifically interact with 
DNA. 

The utility of ZF sequence analysis and alignment is illustrated by Figure 1. The 
TFII1 A protein is widely used as a model for ZF proteins both in terms of physical 
measurement and modification and theoretical data analysis. For each of the nine zinc-finger 
domains the TFIIIA amino acid sequence in this figure has been aligned so that the zinc- 
binding amino acids, the two cysteines (CYS) and the two histidmes (HIS), are aligned in four 
columns In order to achieve this alignment dashes must be inserted into the sequence at 
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various places to prov.de for domams wn>ch have additional ammo acids. The same type of 
ahenment has been done for ZF protem MKR2 and the Kmppel proteins. The MKR2 
sequence alitnment is very compact; there is no need for any insertions, since all of its ZF 
domains are of the same size. Compared to TF1I1A, MKR2 acts as a much more uniform 
model for studying the interaction of the amino acids of the protein wuh the bases of the 
specific double-stranded DNA. To arnve at the present invention, MKR2 has been used 
exclusively as the sequence basis for deducing the general rules which govern DBP-DNA 
interactions. 

The crystallography analysis of a complex conta,n.ng three ZFs from ZF protein 
Zif268 and a consensus DNA-binding site helped identify the localization of ZF-B-DNA 
recognition sub-sites [7]. Because the mutagenesis and sequence investigation results are in 
accordance with crystal structure data, it is reasonable to expect that the same contact regions 
also oarticioate in the interaction of other ZF-DNA complexes [5,6,8-10]. Thus, it has been 
assumed that the following ZF components of a ZF protem play a key role in the anti-parailel 
DNA reading process: 1) the AA immediately preceding the a-helical region of the protein; 2) 
the third residue withm the a-helical region, i.e., that immediately preceding constant leucine; 
and. 3) the sixth residue of thus region, i.e., that immediately preceding invariant h.stid.ne. 

These components are ind.cated below as Z 3> Z 2 and Zj, respectively, in the 
generalized ZF sequence (a-helical and b-structural regions are underlined) given in Formula 1: 

Y/T X C X,,CGfDK^ X F X Z 3 XXZ 2 LXZ i H X M H T/S G/E Xo, E K/R P 

P-structurai region a-helix 



Formula I 



wherem X is any amino acid; X w is a peptide 2 to 4 amino acids in length, X,. s is a peptide 3 
to 5 amino acids in length; Xc. 2 is a peptide 0 to 2 amino acids in length and C, D, E, F, G, H, 
K, L. P, R, S, T and Y designate specific ammo acids according to the standard single-letter 



[r , r,]]f>A h\> ^t^p^ the 
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Keepmg in mind the above formula. 



tnnucieotide-peptide complexes with three (first, 



one can envision the formation of antiparallel, 



follows: 



second and third) contact positions 



as 



5'-N, -N 2 -N, -3' 



COOH-Z,-Z 2 -Z 3 -NH: 



The crystaJlographic investigation of the 7if?*s nv a 

the wav t he r . . Zif268-DNA complex also gave indications of 

h w th comact ^ imeract pavietjch ^ ^ conduded 

- --en bonds (H-bonds) w,th the bases of the co din g DNA strand in the ma.or 

I ThT" rCS,dUeS m ^ COmaCt P ° SitIOnS (S6e positions 

ma,e H bonds w lth the N 7 ,d Co atoms of the guanJ ne. Three ar g i ni ne resL 

—due in t.s position forms lateral H-bond, s,t b ridg e interactions w,th c^ te 
* u P s ot aspanic acid oc CU mn g as the second residue ,n the a-hel, The N6 atom of the 

to m - T ddIe — P ° SItl0n " - — * — ^ s an H-ond 

::,:;„; :zzt^t r - Mine res,dues n - — 

poiypeptide-DNA complexes «s confirmed by experiments of directed 
«da and pnage 434 repressors. complexed ^ M 

« can „so b e H-c,oded by , ysine , ^ ^ *> 

No do™, ,he remainrng polar AA's - rhreonrne and tyrosme _ ' ' 

bonds wi,h guanine »<= able ,o form analogous 

" H t ' hat ° W — *• « - — glgro n ,„ 

, ITr protan " DNA bindin8 ^ haw sh °™ - - 

™ ,n.erac„„„ does occur «h born gtaannc acid and aspartc acd [5,6 9,4,9, 

and Berg [ , 4, propose, an H-.on.ng fomruia for U,e M eracoo» be™*, cUe 

nrerac, on w rn cv.os.ne depends on * presence of giuranrrne or arg,„,„e u, ,he rhird 
ardelh et al. W reve al that cytosine can interact with a glutamane residue. 
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This may also be true for asparagine, which has similar polar groups. Cytosme should also be 
capable of making an H-bond with the hydroxyl oxygen atom in senne and threonine residues. 

Thymine in the Zif268-DNA complex does not seem to participate in the recognition 
process. However, the crystal structure investigations of the lambda repressor, DNA-binding- 
domain DNA and engrailed homeodomain-DNA complexes, as well as ZF protein-DNA 
binding assays, demonstrate that thymine can make both hydrophobic contacts with non-polar 
residues (alanine, leucine, isoleucne, valine) and H-bonds with polar AA's (lysine, arginine, 

glutamine) [8,11,14,17,20] 

The X-ray crystallographic studies of lambda and phage 43 4 repressor, DN A-binding 
domain complexes with corresponding operator sites revealed that an adenine base forms two 
H-bonds to glutamine: 1) the amide NH 2 -group of the glutamine side chain donates an H-bond 
to the N7-atom of adenine and 2) the amide O-atom accepts an H-bond from the N6 atom 
[17,18]. Similar H-bonds have been found between adenine and asparagine residues in the 
two homeodomain complexes [20,21]. ZF protein-DNA binding assays also indicate, that in 
ZF contact positions, adenine makes strong interactions with both glutamine and asparagine 
[8, 1 1,12,14]. Considering that glutamic and aspartic acid carboxylic groups have O-atoms 
capable of accepting H-bonds as do glutamine and asparagine amide O-atoms, one may 
suppose that adenine can form a single H-bond with both glutamic and aspartic acid. Indeed, 
Letovsky and Dynan [19] have shown in a directed mutagenesis investigation that 
transcription factor Sp 1 . containing a glutamic acid residue in the central contact position of 
the ZF, binds only 3-fold more weakly to the aderune-substituted variant (-GAG-) than to the 
wild consensus recognition site (-GCG-). In addition, Desjarlais and Berg [14] and Berg [8] 
think it probable that adenine can (like guanine in the Zif268 -DNA complex) make one H- 
bond to a histidine residue. It is likely that not only histidine but also other polar amino acids 
(arginine, lysine, tyrosine, senne and threonine) are capable of forming an H-bond to atom N7 
of adenine. 

A database of potential ZF protein domains, containing 1,851 entries, has been 
assembled This database was used computationally to observe the homologies between the 



uniquely laeriurymg dm p<uui 
proposed that fidelity of recognition may be achieved using two H-bonds, as occurs in the 
. n3icr arom -c when asparagine or glutamine binds to adenine, and arginine binds to guanine. 
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On the basis of the above-given results, it was reasonable to test, usin« the models 
descnbed herein, base recognition at the ZF contact positions of the following AA's: 

1) guanine - R, H, K, Y, Q, N, S, T, 

2) cytosine - E, D, Q, N, S, T; 

3) thymine - 1, L, V, A, R, H, K, Y, Q, N, S, T, 

4) adenine - Q, N, E, D, H, R, K, Y, S, T 

Plastic space-filling atom.cmolecular and ionic models [23,24] have been used to build 
ZF-DNA complex imitations. These molecular models were chosen due to the extraordinary 
firmness of the.r connectors, their convenient scale (lcm = 1 A = 0. lam) and the,r improved 
theoretical parameters which were very suitable for the modeling of macromolecules New 
-doles of tetrahedral carbon atoms, with bond angles 100* and 105', dihedra. oxysen atoms 
(120 , and tetrahedral phosphorus atoms (102° and 1 18°), maintained the exact modelm* of 
deoxynbose puckenng and sugar-phosphate chain conformation in the B-DNA model = 
Peptide bonds ,n the DBP models were imitated by the fixing, to each other, of speca. 

modules of carbon atoms (bond angles 116° 120 5° and 12"? n aru i : 

51 D 110(1 123.5 ) and nitrogen atoms (122° and 

H9) The zinc ion was represented in the model by a sphere (R = 0 85 cm) fixed 
tetrahedrally to N and S atom modules of ZF histidine and cysteine residues A long 
horizontal 34-base B-form DNA mode. w,th laterally-fixed DBP models was used for docking 

experiments. 

I» the first sage of the subject investigation, the models of Zif268 fingers 1 7 and 3 
we assembled, and the gene™, spaual orientate of the ZF-B-DNA complex was' observed 
1, the second stage, the steric fitness of a,, 64 „ucleo,,de triplets to the different comb.na.ions 
of the above-menaoned AA's in the cntical portions of theZF-DNA complex « modeled 
A plasnc mdecular model of the Zif268 peptide-DNA complex was assembled on the 
bas,s of crystaliograplnc data [7]. After the .mhatmg ofZF-DNA backbone contacts and H- 
bonds between AA and bases in the major groove. was confirmed tba, the overall 
arrangement of Zif268 is annparallel ,o the DNA strand. The most steady ZF-DNA, 
nonspecfic interact™ seems to be the H-bond between a phosphodiester oxygen atom and 
•he firs, .nvanan, histidine residue fixed to the U> ,o„. a conserved arginine on the second b 
strand also contacts phosphodiester oxygen atoms on the primary DNA strand However 
fingers 2 and 3 of Zif** contact eoutvata, phosphates w»h respect to the 3-bp sub-sites,' 
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whereas the finger-1 H-bond is shifted by one nucleotide. Another four ZF-DNA backbone 
contacts made by argirune and senne residues are even more irregular in relation to the ZF 
modular structure. 

All 1 1 critical H-bonds found in the Zif268-DNA crystal complex have been observed 
in the plastic models. As expected, tie threonine residue in the first contact position of the 
second finger was too far from thymine to make an H-bond. However, differing from the 
reS ults of crystal structure analysts, the model investigation clearly indicated the possibility of 
hvdrogen bonding between a glutamic acid residue and cytosme in the second contact position 
of fineers 1 and 3. 

" It is noteworthy that, of the s,x guanme-AA contacts in recognition posmons observed 
in the Zif268-DNA crystal structure, five were made with arginine and only one with histidine. 
It is even more interesting that this histidine-guanine interaction was the only one in the 
central-specific position. Considering the smaller size of histidine in comparison with argm.ne, 
it mav be supposed that the middle position has stenc constraints prohibiting contact between 
guanine and the larger ar g ,rune residue, although, due to its capability of forming two H- 
bonds the latter pairing should be energetically favored. 

' To investigate the spatial conditions in different recognition positions, a B-DNA model 
was bu.it which contained, in the primary strand, 1) the triplet GGG, and 2) models of ZF a- 
helical protein fragments (including the AA immediately preceding the a-helix) with a) s.de 
.roups of the first Zn-binding histidine and b) groups for critical AA triplets R,R,R, and 
RiH,* The models of a-helical fragments were fixed to the B-DNA model by an imitation 
of an H-bond joining a phosphodiester oxygen atom whh a histidine residue. Specific base- 
AA contacts were then tested in these complexes. It was elucidated that only the complex 
GGG-R,H 2 R 3 contains the contact groups in positions corresponding to the distances of 
critical H-bonds found in the Zif268-DNA crystal structure. The complex GGG-R.R^ is 
stencally unfavorable; molecular modeling reveals that, although in the outer contact posmons 
guanine and arginine can be joined by two H-bonds, in the m.ddle position such a pair cannot 
be included due to the limited space. 



■ _ rt.r^nroc tmm mianine > anti v }i ' 

:cmplex G|U 2 0 3 -RiH2K3 ( me lguuwili* ayyiwii^ ^ » - ^ 

to the C, atoms of corresponding AA's have been determined: G,N7-Ri=7A, 0,06- 

S A G -N 7-H: -5 5 A, G ? 06 H : -6 5 A, G,N7-R : .=RA and G,06-R,=7A 



atoms 



WO 99/42474 



PCT/US99/03692 



bsing the models, the investigation of B-DNA an n n u' u ■ 

the molecular h™* f CilX baSIC elucidated 

molecular basis for stenc constraints in the second ZF DMi 
t • ■ , . , <^cLona /Lt-DNA recoeniuon position 

Joirung, by a straight line, the analogous atomic * ro „n s (f " 

- - - - - - * «» , - - 

- ^ — — - - — = 

joining the C. atoms of the AA's in the first and th' a * 

:~ rr- " - - - - — - — *~ 

Anaivs,s of*. abov, 8 ive„ da,a „„ ,„ e 2F-DNA backbone co„,a M s a, we„ as 
observes derived f rom ,„ e ^ |ed ,„ ^ * " 

^uuusion that there are cons derabie 
Terences m spatial conditions between first and third ZF DMA 

th.fi ZF-DNA recognition positions In 

the nrst posmon the C a atom of the AA is distanced about 6 SA * u\ 

Lea aoout 6 5 A from the phosphodiester 

oxygen atom where the ZF protein is fixed to the DNA backbond „ 

residue n..- m th , , ■ backbone by the invariant histidine 

11 SM " " 8 ° f ZF Pm » - **dom of 

»nfo ma „ona, r ea,a„ geme „,s in ,,e first C0 „ Iaa posi , jon 

correspondmg s,de chain, can be moved 2-3A ■„„ , „ . . 

- pnma, DNA J. ^ ^ " ^ " * ^ ^ " 
- *. o.her ban, ,be «n g of te ™™ » * - 

DNA backbone seems ,o be ra,her ioose and vanable .here " ** 

v *naoie, therefore allowing relatively Iaiw 
rearrangements for the C. atom and the corresponds AA in the th' H 
lan^r ™ , Hu»uin s aa in the third contact position The 

latter contact position is favored by the fact that th. r 

fi- u heCoatominth 's position is more Hi« a m 

from the main fixation place rabout in si «- i is more distant 

hmiH- •„ Phosphodiester atom bound to the 

histidine residue), and the corresponding AA in thi< „ rt • • • 

^ ng aa in this position is not a pan of the a-heliv Th- 
most important finding is that, due to the above-describeH r 0ttheahel * The 

«*- -act site can apparently occupy very ^^T^ ^ ^ " '* 

™. — *, residue may, m cenai o coo^^ " * ^ 

rn „ , complexes, be very close to the base of the 

T tatnWy ^ ^ °"< «- * *• appearance of such a 8 l KnC a, 
IT" 0 " '\ " ^ nSh, - ha " d ' d ^ "* ° f B ^ Nakes J 

P 'ally in the second contact position, this DNA strand is capable of 
Participating m the ZF-nucleic aciH r„„ • ■ capaoie ot 

acid reckon process. In the Zif268-DNA crystal complex, 
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the a-helix of each ZF domain, which is bound only to the DNA primary strand, is tipped at 
about a 45" angle with respect to the plane of the base pairs [7]. In cases wherein the second 
DNA strand, via critical H-bonds involving the third and second contact positions, is involved 
in the reading process, the direction of the a-helix axis should be even more perpendicular to 
the base pair plane. 

Thus, this more detailed investigation of ZF-DNA-complex imitations, through use of 
P h y s.cal molecular models, shows that stenc conditions in each of the three contact regions 
are different. These stenc conditions are reflected in the ZF-DNA recognition rules. 

On the basis of information obtained above, which yielded a general observation of 
stenc conditions in the ZF-DNA recognition process, an extensive model study of various AA- 
base combinations in the cntical contact positions was undertaken. The results of this 
investigation are presented both as the ZF-DNA reading code and mam rules for recognition 
(Tables 1 , 2 and 3). The rules are in good accordance with crystallography directed 
mutagenesis, DNA-binding and sequence analysis data. 

With reference to the sequence of Formula 1 and the 2-dimensional structure diagram 
in F,sure 2 (which provides a schematic representation of a zinc-finger domain and its 
interaction with a DNA strand), the studies confirmed the identity of the three cntical contact 
positions in a given zinc-finger domain as follows: 

1 ) between the first nucleotide in the triplet and the first AA preceding the 

constant histidine at the COOH end of the a-helix; 

2) between the second nucleotide in the triplet and the fourth AA preceding the 

constant histidine at the COOH end of the a-helix; and, 

3 ) between the third nucleotide in the triplet and the seventh AA preceding the 

constant histidine at the COOH end of the a-helix. 

Stenc conditions in the three contact sites of the ZF-DNA recognition complexes are 
different The first contact position is relatively large and strictly fixed, which enables the 
binding of a longer AA to bases on the primary DNA strand with sufficient specificity and 
affinity. The second position is compressed and can accommodate smaller AA's with 

:f , Htv ,„< ,ffi n j n . Th» 'bird position allows considerable conformational 



h nucieouoe or a given DNA tnpiei on trie onmaiv >u<uiu. o..; nui 



(Column A) and alternative (Column B) base-binding AA's are presented. Both specifiaty 
and affinitv were considered in including a residue in Column A. As was proposed already by 
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Seeman at aJ. f22], the fidelity of recogniuon ,s better ^ ,„ the case of punne bases 
(guanine and adenine), because they occupy a greater portion of the major groove and offer 
m ore hydrogen bonding sues than the pynmidmes. Therefor, the strongest AA interacts 
appeared to be those of argirune, glutamine and asparagme, each bindins bvtwo FT-honH, ,„ 
e,ther guanine or aden.ne. The affinities of asparnc acid, glutamic add, asparagme and 
g<utamine were frequently enhanced by the formats of water bndges between carboxylate or 
am.de oxygen atoms and DNA backbone, phosphodiester oxygen atoms. Although van der 
Waals interactions are relatively weak, thev can play a certatn role in recognition of the 
thymine methyl group by hydrophobic AA's (alanine, valine, leucine and iso.euane) 

As indicated in Table i, in many ZF-DNA complexes the base recognition in the 

nucleotide triplet of the nnmarv DNA «xanH ■ , 

primary UNA strand occurs not entirely via the primary strand, but by 

binding simultaneously to both the nrima™ ™a „ , 

. o ootn tne pnmary and complementary strands, or even exclusively to 

the complementary strand Without "help" from the complementary DNA strand, the bind*, 
of cntical AA's to nucleotides of the pnmary DNA chain would be too weak, ,„ the case of = 
-era. triplets, to real, Ze the recognition process. Al, possible .AA replacements were tested 
tor strength of interaction ,n the Z,-Z 5 positions. Domains with fewer than 2 hvdro 8 en bonds 
on the pnmary strand were considered to be unstable. 

Table 2 presents the ZF AA triplet, having the highest affinity for interaction with 
corresponding DNA triplet,. These ZF triplets contain only the mam residues presented in 

Column A of Table 1 . Table 2 also nrespmc. th» w a- 

also presents the binding energy components (H-bonds water 

bndges, van der WaaJs interactions) maintains the 7V nwA 

> Hdiniainmg the ZF-DNA recognition process in specific 



contact regions 



As can be seen from Table 2, the partictpatton of the commentary DNA nrand ,„ 
■he process of ZT bindtng, combined with the aumber of interacI10ns ^ % w 
and van der Waals mteracons) possiNe i„ , he ,bree comae, regtons, wben optimal 
contbmafons are used, makes k poss,b.e to sbow that a compiex formation with ail 64 DNA 
tnpiets can be achieved. Tabie 2 shows tha, the m^m, ^ of H-bonds, ,he strongest of 
«he three types of interactions, is obtained when the firs, nudeottde of the uipie, is guantn. or 



adenine. 



In nuc.eot,de tnplets wherein the number of H-bonds possible is less than maxnnaJ the 
deficiency is often partially compensated by a significant amount of water-bridging between 
cntica] AA's and the sugar-phosphate backbone. 
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Even ,„ a. wheretn .he firs, n U deo,,de of .he tnple, „ ,hym,ne. and .he nu.be, of 
th e H-bonds ,s .owes, .) .he formauon of .wo H-bonds be,wee„ .he AA ,n ,he Z, pos,t,o„. 
od .he adentne - complementary ,hy™ne ,n ,he tfrd con,a pos„,on, and 2) probably, a 
sme ,e H-bond benveen thym,ne and senne or threomne ,n .he second comae, postuo* mean, 
,„a, even TTN mptas can bind a ZF protein with sufficient affirury 

,„ anv even, ,o obtain DBF's of the grea.es, effectiveness. at.en.ion should be paid ,0 
having the s,ronges, interactions m .he flanking comae, po,n,s (I and 3). If weaker 
comb,n.„ons mus, be used, ,hey would have less effect ,f positioned in ,he ce„,e, comae, 
..... mhi , imoonam ,0 no,e, however. ,ha, even weak binding in ,he contac, po,„,s » 



mvportan, for esiablishing specificity. 

Table 3 presents .he ma,n ZF AA ,nple,s of Table 2. as well as ,he al,ema„ve AA s 
,show„ ,» Column B of Tabie 1 ) which would be also expected ,0 prov,de effective b,nd,„g ,o 
,be respecve bases of a g,en DNA tnple, Table 3 also presems the b,„d,ng energy 
components (H-bonds. wa,=r bridges, van der Waals ,n.e,.c„ons, maintaining ,he ZF-DNA 
recognition process in specific comae, regions 
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Table i - z\ 



Codon 21 

Column 



AAC 




AAG 


Ox 


AAT 


0= 


ACC 


Q- 


ACT 


G» 


QAA 


R* 


QAC 


R- 


GAG 


ru 


GAT 




GCC 


R= 


OCT 


R. 


ACA 


0= 


AOG 


Q« 


AGA 


0* 


AGG 





CAA 

GAG 

CAT 

CCC 

CCT 

GCA 
QCG 
GGA 



E* 

E* 

E* 

E* 

E' 

R= 

Rs 

Rs 



H-/K- 
ACC Q = 

ACT Qs 

QQC Rs 
QQG R s 
GCT R« 



CAC 


E* 


OCA 


E* 


COG 


E* 


CGA 


E* 


CGC 


E* 


COG 


E' 


CTJT 


E' 


ATA 


Q* 


ATC 


0« 


ATG 


Q» 


ATT 


Q« 


OTA 


fU 


OTC 


Rs 


OTG 


R> 


QTT 


fU 


TAA 


WLf 


TAG 




TAT 


l«/Lf/V# 


TCC 


»*7Lt 


tct 





ZT 

' Column B 

E */R t~/K ._/fcJ _ tn a* 

R-/K-/E' 
E*/K T - 

E-/R-./K,- 
K-/H,VY f -/Q t - 

K-/HWY,- 

H-/K-/Y-/Q- 

H-/K./Y-/Q' 

H-/K-/Q-/n-/(y./S-/T-) 
H-/K-/Y-/Q- 

R-/K-/N-/E-/0- 
R-/K-/N-/EVD' 
EVRWKt- 
EVR,./ Kt . 

QVN,VD t VR 3 -/K,./Y ? -/S 3 ./T ? . 
OVNVDVRj./Kj- 

OVR : ./K 1 -/fOVNVS-/T-/Y J -) 

QVN^/CVR^/K^/Q^/N,- 
OVRWK,. 

K-/Q-/(H-/Y./N-/S-/T-) 

H-/K-/Y-/Q./N-/S-/T. 

H-/K-/QVNVY,. 

R-/K-/E' 
EVR.-/K-- 

K./QV(M-/Y-/N-) 

H-/K./Y-/QVN* 

H./K-/Y./Q-/N. 

OVR,./K f - 
QVR,./K a - 
0VNVDVR,3/K,- 
9VNVDVR.-/K,- 

QVNVDVRWK^/O,. 
OVD-VR-W^O,- 

EVR.-/K,- 
N-/EVOVR--/K-- 

EVR t -/K t - 
H-/K-/Y-/Q. 
H-/K-/Y-/Q* 
K-/Q-/HT. 

H-/K-/0-/N./Y-- 

R-/KWQ-/V,# 

R-/K-/Q* 
R./K-/Y-/Q^N. 

fWH-/K-/QVNVVf/A# 



Hydrogen W.t.r Hydrophobic 
B£ *W* Contacts Contacts 



6 

$ 
6 
6 

6 

6 

6 

6 
6 
6 
6 



5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 

5 
5 
5 

5 
5 
S 

4 

4 



0 

0 
0 
0 

0 

0 

0 

0 
0 
0 
0 



0 
0 
0 

0 
0 

0 

2 
2 
2 
2 



1 
1 

0 
0 
0 
0 

0 
0 
0 

0 

0 

0 
0 
0 
0 



0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



0 
0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 

0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

0 
0 
0 
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CTA E" 

CTC E* 

CTG E* 

CTT E' 

TCA lt/L# 
TCG 

TGA ti/Li 

TAC l#/Li/V# 

TCC l,#/L,#/Vi# 

TCG »#/L# 
7GT !# 

TTA l#/L# 
TTC lf/Lf/V# 

TTG 

TTT i#/L# 



R-/K-/Q* 
R-/K-/Q* 
R-/K-/Q' 

R-/K-/Q* 

R-/K-/Hi-/Q,*/N 5 */S,-/T t - 

r-/k-/<t 

R./H-/K-/Q*/N"/Lf 



R-/K-/QVN* 
RWK-/Y-/Q* 
R./K-/Q*/NW,# 



3 
3 

3 
3 
3 
3 
3 

3 
3 

3 
3 

2 
2 
2 

2 



0 
0 

0 
0 

0 
0 
0 

0 



2 
2 
2 
2 
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Legend 



where J separates alternative amino acids 
where X without subscript has ef! its interactions with the nrtr 

wtiere X, has some interactions with the primary strand and some Interactions with the 
complementary strand " ine 

where X, has interaction with the complementary strand 
where X,has Interactions with both the primary and complementary strands 
where - Is one hydrogen bond between the amino acid and the base 
where = is two hydrogen bonds between the amino add and the base 

™ ^oft^S ^ 3 Wttef bf,d9e betWeCn """" « «" P-sphodiester 
where « is one or more van der Waals contacts between the amino add and the base 

where amino acids In ( ) have Interaction with the base of the primary strand wnere one of two 
other possible proteln-DNA recognition Interactions is absent 
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Tabttt 1 • 22 



Codon 22 



Z2 



Column A 


Column B 


AAC 


Gt- 


N=/DVS-/T-/R,-/K,-/E,*/(H-/Y-) 


AAu 




R-/H-/K-/EVDVK t - 


AAT 




K-;G=/E*/D*/R 1 -/Hf/K,-/Q>«/N,e 


ACC 




D*/S-/T-/N,=/(K t -) 


ACT 


N S WD* 


QVN*/E*/S-^T-/K,-/Qt« 


GAA 


* * 


OVRWH^/KWYWOiWEiVKj- 


GAC 




D"/Rt-/H t -/K 1 -/Y 1 -/Qi=/Et*/K 2 - 


GAG 


N = 


Q S /EVDVR ,-/H ,-/Kf/Yi-/Ka- 


OAT 


Qs/N = 


KWE*/D*/K,-/(R-/H-/Y-) 


/grr* 




Q # /N*/D # /S-/T-/R,JH l -/K 7 -/Qr' N t*' s a- /T i- 


GCT 




Q./N-/E-/S-rr-/H,-/K 1 -/N f */Q|s/(Rt*/Ys-> 


ACA 


D* 


QVNVE'/S-/T-/K,- 


ACG 


E*/D* 


Q , /N*/S-/T-/R,-/Hi-/K J -/Y,-/G,=/N 1 = 


AGA 


N * 


R ,-/H,-/Ki-/Y,-/Qi* 


AGG 


Q*/N 


rV/H,*/Ki- 


CAA 


N- 


D*/S-rr-/Ri-/H,-/K,-/Y,-/Q,=/Ei- 


GAG 


0= 


N=/EVDVR,-/K,-/K,-/Qj= 


CAT 


0,=/N = 


D-^S-/T-/R,-/H 1 ./IC,-/Y 1 -/E,s/K l -/Q>=/N,=: 


CCC 




NVOVS-rr-/H i ./K J -/N,=/(Y 2 -) 


CCT 


N,=/D* 


Q /N /E /S*/T'/n^/l\|-/U|» 


GCA 


0* 


QVN*/E VS-/T-/R j*/H l -/Kr/Ti'/u 3 -/rij= 


UUu 


c # u 


Q*/N /S-/T-/K I -/G|«/N 3 »/(ni-) 


GGA 


N* 


avs-rr-/K,-/(R-/Y-/H-) 


AAA 


N= 


0*/Ri-/H,JKi7Yi JQ,=/E, 




u . 
rh* 


a*/N*/S*/T-/Rt-/Kt-/(n=/T-J 


ACT 


H,- 


QVN*/S-/T-/R,=/Ki- 


GGC 


Hr 


NVS-rr-/K 1 -^(R-/K-/Y-/Q"] 


GGG 


H- 


K-/QVN*/S-/T-/Y,-/(R-) 


GGT 


H- 


QVNVS-/T-/RWK,- 


GAC 


N= 


D*/Ri-/K t -/Gi«/E,* 


CCA 


D* 


NVS-/T-/Q,VE 1 VK 1 -/N I « 


COG 


D* 


Q*/NVE*/S-/T-/K,*/Q,«/N,»/(R-/H-/Y-) 


GGA 


W 


QVS-/T^Ri«/H,-/K,-/(Y-) 


GGC 


H,. 


QVN-/S-/T-/K,-/(Ri-/Y-) 


COG 


H- 


K-/Q*/NVRi-/(Y-) 


CGT 


H- 


Q*/NVRi-/K,-/(Y-) 


ATA 


ll/L#/V#/A# 


s-rr-/K,-/Q,VN 1 v(R-/H-nr-) 


ATC 


Iff/Lf 


NVS-/T-/V-/Rt-/H,-/K 1 -/Y 1 -/Q,VR J -/K, 


ATG 


1#/L#/Vf 


s-rr-/E,./o,-/Q,«/N f -/(R-/H-/K-nr-) 


»r 




n ?h ./K-/V./Q VN*/S-/T 




It/L#/V» 






If /L t/Vf 


Q*/N"/S*/T-/Ki-/(H-/M-/T-/A-y 


TAA N> 


DVRWHWKt-nrwawEi- 



Hydrogen Water Hydrophobic 
Bonds Contacts Contacts 



6 

6 

6 

6 

6 

6 

6 

6 

6 

6 

6 

5 

5 

5 

5 

S 

5 

5 

5 

5 

5 

5 

5 

5 
5 
5 
5 
5 
5 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 

0 
0 

2 
2 
2 
2 



1 

1 

0 
0 
0 

0 



0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

c 

0 
0 

0 

0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

0 
0 
0 

1 

1 
1 
1 
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Table 1 - 22 



TAG N= 

TAT Ns 

TCC Q,-/E* 

TCT N,*/0" 

CTA I#/L#/V#/A# 

CTC lt/L# 

CFG lff/L#/V* 

ctt ita# 

TCA D* 

TO D* 

TGA N* 

TAC N = 

TGC H r 

TGG H- 

TGT H r 

TTA l#/L#/V#/A# 

TTC l#/L# 

TTG l#/L#/V* 

TTT l#/L#/V# 



0=/EVDVRWKWK,- 

K-/Q=/N=/E'/DVS-rr-/H 1 ./H 7 ./K l -/Q J «/N J = S /(R-/Y-) 

Q*/NVD*/S-/T./R I -/H I -/K,./Y,-/Q I ./N 1 - 

QVNVEVS./T./R.r/HWKr/YWO,* 

S-/T-/Q,=/N t W(H-/K-) 

S-/T*/V-/R 1 */H 1 -/K 1 -/Y 1 -/O f '/N,* 

NVS-/T-/KI-/Q,' 

QVNVS-/T-/V#/R 1 -/H 1 ./K 1 -/E l ./D a -/Q,=/N,«/(Y.) 
QVNVEVS-rr-/R,«/H J -/K,-/Y 1 -/Q J «/N,» 

QVNVEVS-/T-/R 1 «/H 1 -/K,-/Y j -/S,-/T,-/Q,b/N,« 
Q'/S-rr-/H,-/K,-/(R-/Y-) 

D./H,-/KWQt=/E,*/K J -/(R./Y-) 
Q*/NVS-rr-/R,=/K,-/Y,- 
R-/K./QVNVY t - 
NVS-/T-/K,./Y t -/Q,*/(Rr) 

S-/T-/R 1 -/H 1 -/K,-/Y ( -/E I -/Di-/Q > -/N,» 

NVS-/T./V-/AWR,-/H,-/K,-/Y,-/K,-/0,«/N,. 
NVS-/T-/H,-/K,-/Q,* 

QVNVS-/T-/A-/H,-/K 1 -/fR-/Y-\ 



4 
4 
4 
4 

3 
3 
3 
3 
3 
3 
3 

3 
3 
3 
3 

2 
2 
2 
2 



0 

n 

0 
0 



0 
0 
0 
0 

0 
0 
0 
0 



2 
2 
2 
2 
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Legend 

where / separates alternative amino adds 

where X without subscript has at) Its Interactions with the primary strand 

where X, has some interactions with the primary strand and some Interactions with the 
complementary strand 

where X, has interaction with the complementary strand 
where X, has interactions with both the primary and complementary strands 
where - is one hydrogen bond between the amino odd and the baste 
where = is two hydrogen bonds between the amino acid and the base 

where * Is one hydrogen bond via a water bridge between the amino acid and the phosphodiester 
orvqen atom of the backbone 

where t is one or more van der Waals contacts between the amino acid and the base 

where amino acids In ( ) have Interaction with the base of the primary strand wnere one ot two 
other possible protein-DNA recognition interactions is absent 
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TabU 1 - Z3 



Cooon 


Z3 


Z3 


rtyon>9*n 


W«l«r 




Column A 


Column B 


Bonos 


Contacts 


AAC 




Y-fO /N , /E , /D*fH,-/K ? -/Y T -/0 j*/N, 


6 


0 


AAG 


fU 


H-/K-/Y-/QWO j*/Ei-/N2*/D, '/S,-Mj- 


6 


0 


AAT 


Oj- 


R-/H-/K-/Y-/Q*/fi,7K»-/Q>^N»^E» /Dj /Oj* 


6 


0 


ACC 


"t* 


Q'/E*/K T -/N 1 VS|-/Tj- 


6 


0 


ACT 


Or- 


R-/M-/K-/Y-/Q*/R,-/H,-/ICi-/Yj-/N,«/E» IV, 


6 


0 


GAA 


Q- 


R-/H-/K-/EVRj-/H r -/K I 'VY T -/Oi 


6 


0 


(SAC 


Qt-/E* 


Q /R H j-/K j-/T j -/Q j*/ N » / S jr/Tf- 


6 


0 


GAG 


Hs 




6 


0 


GAT 




n f Li /V /V * ■ fi in n j *v / hJ ^ / F * 

R-/H*/K-/Y-/a /N /l-/Rj-/nr-/K|*tT i-/Nj«/t| /U| 


o 


0 


GGC 






6 


0 


OCT 




R./K./Y-/Q/R 1 ./H r /K 1 -rY,-/N,VE,*/D 1 */S r /T I - 


6 


0 




G= 


a iu /if /v y u — 1 1 B*rn*/i a fi liuifD ./u _ /fcf .jvi.Jft */M • // c ./r k 






ACQ 


n 

R« 


H -/ (V. • / T -/ U /K /S-M-ZCj /Uj /Q|S/N|S 


9 




AjGA 


Qs 


p JO i kJ i */ ia Ju • /■ « / ■ m t\t A/A A 

E-/H f/H f/K i*/ • J-/Q j-/N , /l]l/L 9 ff /»jf f Af ff 


5 




AGG 


Rx 


u» / j #^ tiki tic 4 * 

tWY-/Qi /N| /EfVDj 


r 

3 




CAA 


Q,- 


H-/H-/K-/Y-/0=/E /R j-/Hj*/Kf-/T}-/U| 


9 




CAG 




H-/K-/Y-/a-/N-/Q,VN,VE,*/0»VS J -rr,- 


s 




CAT 




R-/H-/K-/Y./OVD,./E r 


5 


1 


CCC 


«v 


QVE*/H,./K T -/Y t -/Q,VN,*/Si-/T,- 


5 


1 


CCT 




R-/K-/K-/Y-/QVH t ./K,-nf,-/N,"/E»*/S t -/T r 


5 


1 


OCA 




R-/M.yK-/Y-/E*/R,-/H r /K,-/Y,./Q I '/Nj*/S,-rr,-/l,#/L,f/V,l/A J l 


5 




vEC 


rt= 








CCA 


Q* 


NWEVDVR,-/H J ./K r -/Y,-/QfVN 1 */I,i/L,l/V J t/A,« 


5 


1 


AAA 


Q= 


n^K-/N=/E-/D-/F T -/H,WK t -/Y,-/Q 1 -/N,-;i 1 in- t # 


S 


0 






W /rt /t /U / rl}>/f^f •/ T f *f Uj / ri j 


c 

3 


J 


ACT 




R-/K-/YWQVN*/S-rr-/lf/Lt/H,-/K t -/Y 1 -/N,= /E2 - /D,' 


s 


0 


OX 




aVNVEVO"/H I WK,-nr,-/Q,*/N,'/S,-/T I - 


5 


0 


OLX 


R= 


H-/K-/Y-/QVNVE,VD,VS r rr I ./Q,=/N,« 


5 


0 


GST 




R-/K-/aVNVK,-/E,*/D,*/S,-/T,- 




0 


CAC 


E* 


q*/r,k/h,-/k,-/y,-/o,* 


4 




OCA 


Qs 


EVR,-/H,WK,-nr,-/Q 2 VN 1 VS I -rr,-/l,*/L,t/V t t/A,# 


4 




CCG 


R= 


H-/K./Q-/N-/E,*/D t VQWN 1 = 


4 


2 


CCA 


Q= 


R-/H.yK./Y-/NWE*/0VR I -/H,-/K t -/Yj-/Q 1 '/(N,*/S,-/T f ./1,*/L t i;v 1 J/A t l) 


4 


2 


CDC 


Bj- 


QVNVE , /0 , /H,-/K f -/Y 1 -/Qf-/N 1 - 


4 




COS 


R= 


H-/K-rr-/Q , /E 1 -/D t «/s ? -rr t -/Q,WM t « 
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Tab* i -23 



CTG 


fU 


CTT 


Ot- 


TCA 


0 = 


TO 


R= 


TCA 


0* 


TAC 


E- 


TGC 


Ri- 


TCG 


Rs 


TCT 


0*. 


TTA 


0 = 


TTC 


R,-/E* 


TTG 


fU 


TTT 





H -/K-/Y-/Q , /Q,VN r /E 1 VO,-/Sj*n" r 

QVNVEVD*/H f -/K 1 -/Y T -/Q> , ' N »* 
H-/K-/Y-/QWN,-/E r /D,-/&,-/T r 

R-/H-/K./Y./EVRWH ) -/K 1 -/Y t ./NWSr/T 1 ./l,l/t I l/V,l/A,i 

Q , /NVDVH ) WK t -/Yi-VOWNi-/S,-Tt- 

H-/K*/Y-/Q-/a t -/M, ; /Et*'Cr 
R./H-/K-rr/R r /H J -/K,-/VWQ I =/N, S yE»-/DWN.= 
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Legend 



where > separates alternative amino acids 

where X without subscript has atl Its Interaction* u,i*h >_ m ^ 

where X, has some Interactions with the primary strand and some Interactions with the 
complementary strand w 

where X, has interaction with the complementary strand 
where X,has interactions with both the primary and complementary strands 
where - Is one hydrogen bond between the amino acid and the baae 
where = is two hydrogen bonds between the amino acid and the base 

wnara # is on. or mora v.n o.r Wu« contacts batween the amino .eld and tha base 
where amino acids In f ) have Interaction with th* Ha*. 

oth.r po M1 ». pr= te ,„-ONA nJZ^^^T*™' °™ °' ^ 
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Table 2 



Codon 21 Z2 Z3 

Column A Column A Column 

AAC Q Q R 

AAG Q N/Q R 

AAT Q N Q 

ACC Q E/Q R 

ACT Q D/N Q 

GAA R N Q 

GAC R N E/Q 

GAG R N R 

GAT R N/Q Q 

GCC R E/Q R 

GCT R D/N Q 

ACA Q D Q 

ACG Q D/E R 

AGA Q N Q 

AGG Q N/Q R 

CAA E N Q 

CAG E Q R 

CAT E N/Q N/Q 

CCC E E/Q R 

CCT E D/N Q 

GCA R D Q 

GOG R D/E R 

GGA R N Q 

AAA K/R N Q 

AGC Q H R 

AGT Q H Q 

GGC R H R 

GGG R H R 

GGT R H N/Q 

CAC E N E 

CCA E D Q 

COG E D R 

CGA E N Q 

CGC E H R 

tar 

ATA U A/l/L/V u 

ATC Q l/L Em 



Hydrogen Water Hydrophobic 
A Bonds Contacts Contacts 



6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

6 0 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 1 0 

5 0 0 

5 0 0 

5 0 0 
5 0 0 

5 0 0 

5 0 0 

4 2 0 

4 2 0 

4 2 0 

4 2 0 

4 1 0 

4 U i 

4 0 1 



WO 99/42474 



PCT/US99/03692 



Table 2 



ATG 


U 


1/1 /V 

1 / V 


ATT 


Q 


l/L/V 


GTA 


in 


A /I /■ t\t 

M.I 1/ ^/ V 


GTC 


R 


l/L 


GTG 


R 


l/L/V 


GTT 


R 


l/L/V 


TAA 


l/L 


N 


TAG 


l/L 


N 


TAT 


I/L/V 


N 


TCC 


!/L 


E/Q 


TCT 


1/L 


D/N 



R 4 O 

Q 4 0 

E/R 4 0 

R 4 0 

Q 4 0 

Q 4 0 

R 4 0 

Q 4 0 

R 4 0 

N/Q 4 0 



CTA E A/l/L/V Q 3 

CTC E l/L E/R 3 

CTG E l/L/V R 3 

CTT E l/L Q 3 

TCA l/L D Q 3 

TCG l/L D R 3 

TGA l/L N Q 3 



TAC 


1/L/V 


N 


E 


3 


0 


TGC 


l/L/V 


H 


R 


3 


0 


TGG 


l/L 


H 


R 


3 


0 


TGT 


I 


H 


Q 


3 


0 



TTA l/L A/l/L/V Q 2 0 2 

TTC l/L/V l/L E/R 2 0 2 

TTG l/L l/L/V R 2 0 2 

TTT l/L l/L/V Q 2 0 2 



where / separates alternative amino acids 
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The results of the molecular modeling analysts of various ZF a-he!>x complexes with 
the 64 different DNA triplets (Tables 1, 2 and 3), and the findings of spatial peculiarities in the 
three contact positions, are reflected in the ZF-DNA recognition rules On the basis of the 
rules set forth in Tables 1, 2 and 3, DBP's with optimal binding affirnty for any target DNA 
sequence can be designed. The "Column A" designations, i.e., the "A Rules," in Tables 1-3, 
show the amino acids with optimal binding for a given codon (triplet). The "Column B" 
designations, i.e., the "B rules," in Tables 1 and 3, show the amino acids with secondary, but 
still significant, binding affinity for a given triplet. 

" The column A rules range from the strongest triplet recognition with six H-bonds, zero 
water contracts and zero hydrophobic contacts with an evaluated energy of (5x6) + (2x0) - 
(1x0) = 30 to two hydrogen bonds, zero water contacts and two hydrophobic contacts with an 
evaluated energy of (5x2) + (2x0) + (1x2) = 12 The Column A rules ordinarily have a choice 
of just one or two amino acids in positions Z„ Z 2 and Z, . The column B rules, by 
comparison, have from three possible amino acds in each of the Z„ Z 2 and Z, positions to as 
manv as eighteen amino acids in different contacting arrangements in each of the Z,, Z 2 and Z, 
positions. In the evaluation of the column B energies, there are a large number different 
oroupinH of three ammo acids in positions Z„ Z 2 and Z, The minimum energy is three 
hvdrosen bonds, zero water contacts and zero hydrophobic contacts with an evaluated energy 
of (5x3) - (2x0) - (1x0) = 1 5. The maximum energy evaluation for these combinations is, on 
averase, three hvdrogen bonds and either two water contacts or two hydrophobic contacts, 
with an evaluated energy of from (5x3) + (2x2) - (1x0) = 19 down to (5x3) - (2x0) + (1x2) = 
P Thus the column B rules have a narrower energy range (i.e.. from 19 dow* to 1=) than 
do the column A rules, which have an energy range from 30 down to 12 The narrow energy 
range for the column B rules means that the 64 different rules do not distinguish on the bas.s 

of energy as well as the 64 column A rules. 

For example, as set forth in Table 2, a DBP which binds optimally to the DNA base 
tnplet guanme-cytosme-cytosine (GCC) is one wherein the portion of the protein responsible 
for the binding to the triplet is a ZF domain within which is contained a segment having the 



tnptet, Z 3 is an arginine wmcn initial w.m ^ou.wm - , * v... - ■ r 
amino acid; L is leucine and H is histidine 



WO 99/42474 

PCT/US99/03692 

As set forth ,n Table , or 3 ( see the 'column B" entnes for the Z, Z 2 , and Z 3 portions 
for a g,v en codon), a DBF which effectively, if not optimally, binds t0 the DNA base ^ 
guan.ne-cytos.ne-cytos.e (GCC) ,s one wherein the pornon of the protein response for the 

bmaing to the triplet is a ZF domain within which i< m n»;„, . . 

"""""" ° segment naving tne sequence 

Z3XXZ 2 LXZ,H, wherein Z, is an am.no acd selected from the group consisting of htsttdine 
ly,ne, g.utamme, asparagme, tyrosine, senne and threonine which interacts with posiuon 1 of 
the DNA tnplet; Z 2 1S an am™ acid selected from ^ group 

asparagme, aspanic acid, senne, threonine, argi„i„ e , histidine, and lysine which interacts with 
position 2 of the DNA tnplet; Z 3 is an ammo acid selected from the eroup coning of 
glutamme. asparagine, glutamic acid, aspamc add, histidine, lysine, tyrosine, serine and 

threonine winch interacts with posiuon 3 of the DNA trinl^r Y ; c k - 

r ui me li.na tnplet, X is an arbitrary amino acid- L is 

leucine and H is histidine 

1. w,„ b. apprected, of course, , hat DBp , s of mlen „ edlM affimyi . t whKein 
the 2 Z ; a„d Z.contaet anuno adds are selected according ,o , con*™™, of .he "A" m d 

" e deS ' SMd F ° r — ■* - *■ iXXZ^XZ.H w,,h,„ . ZF dotnatn 

finding .0 the ,„p,e, GCC, z, could be an argWne. Z , could be . „ , . 

a« . and 2, cou!d be selected from the group consisting of 6lutlm ,„ e , aspara21ne , ^ 
acd, aspamc add, histidme, lysme, tyrosine, serine and threonine. 

The baste building block for such proteins ,s denoted by the formula 



NH,— ZiF-COOH. 



where ZiF c is a ZF domain of the form 



VFXCX„ CGSDVRX FXZ } XX Z, L *Z, HX„ H, where 

Z, Z 2 and Z } are amino acids chosen from Table 1 , 2 or 3 to correspond to the three 
bases of the DNA triplet, and the remaining components of the formula are as described earlier 
m the desenption of Formula I. 

In the preferred embodiment of the invent™ , ■ n , 

»«u or me invention, a zinc-finger domain for binding to a 

given DNA triplet is designed bv selection nf tt,„ , 

gnea oy selection of the appropnate AA's in Table 2 or in column A 
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of Table 1 or Table 3. In another embodiment of the invention, the ZF domain is designed by 
selection from among the AA's set forth for a given DNA triplet in column B of Table 1 or 3 

One such domain is required for each triplet of the target sequence; for a target string 
of only 3 bases, the above formula defines the protein. 

If the target string of DNA is 6 bases, the DBP design is extended as follows: 

NH—ZiF— {linker}— ZiF 2 —COOH 

where ZiF. and ZiF. are ZF domains designed, as shown above for ZiF C) to bind to the first 
and second triplets of the six bases, and {linker} is an amino acid sequence conforming to the 
pattern 

T/SG/EXmEK/RP* 

again wherein the components are as defined previously in Formula I. 

If 1) the target string of DNA contains 9, 12, or a higher multiple of 3 bases: 2) it is 
required to design a DBP for 3n+3 bases; and 3) the DBP for the first 3n bases is given by the 
sequence: 

NH : — ZiF, — {linker} — ZiF. — {linker} — ... — {linker}— ZiF n — COOH 

then the DBP design is extended recursively and the required DBP is specified by the 
sequence: 

NH2— ZiF ] — { linker } — ZiF 2 — { linker}—. . 

— (linker* — ZiF — (linker^ — ZiF —COOH 

vnere /.il s/r a o mam ceiinneu, as snu^n aoove ;u; . rmuJ vviiair.-; 
of the target sequence of base pairs. 
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Figure 3 provides a schematic representation of a ZF protein wherein n=3, i.e., one 
which has 3 ZF domains (i.e., n=3) connected by linker sequences and is designed to bind to a 
target DNA string of 9 (3n) bases. 

The above rules enable ready determination of the nntimal ami™ ^a/*\ <•„.. u:^-. 

1 — w nwi«yoy iui U1UU1U£ LU 

any given DNA triplet and thus the identification and positioning of the 3 ammo acids in a ZF 
domain which would be the ideal component of a DBP for binding to the DNA triplet. 

The application of the rules can then be extended to design of a DBP containing a set 
number, n,, of ZF domains, which DPB binds to a target stretch of 3 n, nucleotides within a 
given DNA sequence. The target 3m stretch of nucleotides, and the collection and order of m 
domains in the DBP, are such that the binding energy for the DPB and target DNA sequence 
is the lushest possible for any pairing of a DBP containing the set number, n,, of ZF domains 
with any stretch of jn, nucleotides within the entire DNA molecule being screened 

Accordingly, the embodiment of the invention of primary importance is a method for 
designing such a DBP for a DNA sequence of any length. The method employs the rules 
disclosed above in combination with a means of screening and ranking all possible segments of 
3n, nucleot.des within the sequence by their affinities for DBP's containing m ZF domains to 
determine a unique DBP with the desired properties. 

More particularly, the invention is directed to a method for designing a DBP, with 
multiple ZF domains connected by linker sequences, that binds selectively to a target DNA 
sequence within a given gene, each of said ZF domains having the formula 

A , XCX 2 ^C A 2 A ? XFXZ ? XXZ 2 LXZ , HX,.,H 



and each of said linkers having the formuli 



A^AjXo-zEA^P, 



wherein 

(i) X is any amino acid; (ii) X 2 _, is a peptide from 2 to 4 amino acids in length; (iii) X 3 . 5 is a 
peptide from 3 to 5 amino acids in length; (iv) Xc 2 is a peptide from 0 to 2 amino acids in 
length; (iv) A, is selected from the group consisting of phenylalanine and tyrosine, (v) A 2 i: 
selected from the group consisting of glycine and aspartic acid; (vi) A, is selected from the 
group consisting of lysine and arginine; (vii) A, is selected from the group consisting of 
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threonine and serine; (viii) As is selected from the group consisting glycine and glutamic 
acid; (ix) A$ is selected from the group consisting of lysine and arginine, (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine, (xiii) H is histidine; (xiv) E is glutamic acid; (xv) 
P is proline, and (xvi) Zi, Z 2 and Z 3 are the base-contacting amino acids, comprising the 
steps of: 

fa) setting a genome to be screened; 

(b) selecting the target DNA sequence in the genome for binding; 

(c) setting the number of zinc-finger domains to na; 

id) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains n z nucleotides using a first routine where n 2 is determined using the 
following 
relationship: 
n 2 = 3n<i, 

(e) assigning base-contacting amino acids at Zi, Z 2 and Z 3 to each ZF domain, 
according to the A Rules and /or B Rules set forth in Tables 1-3, of a DBP which binds to the 
first nucleotide block from step (d) as numbered from the first 5' nucleotide of the target gene 
sequence to generate a block-specific DBP and calculating the binding energy, Binding Energy 
block, of each ZF domain of each such block-specific DBP as the product of the binding 
energies. Binding Energy domain, of all zinc-finger domains of the polypeptide, each determined 
using the formula: 

Binding Energy domain, = (5 x the number of hydrogen bonds) + (2x the 
number of H 2 0 contacts) + (the number of hydrophobic contacts), 

(f) subdividing the DBP from step (d) into blocks using a second routine to generate a 
subdivided DBP having three ZF domains; 

(g) screening the subdivided DBP from step (f) against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site using the 
following formula: 



;:;;cuianne a ratio or ninain^ enerev K~ tumo n mnrm murine tnr eacn nucleotide- 
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Rb = Binding Energy biock /the sum of all Binding Energy liwn , s for all subdivided 
DBP' s from step (g); 

(i) repeating steps (f) through (h) for each subdivided DBP wherein n<j > 4; 
(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 

sequence containing n z nucleotides; 
(k) rank-ordering numerical values obtained from step (h); and 
(I) selecting a DBP with an acceptable R*> value. 

Preferred embodiments of this aspect of the invention are: 

1 ) the design method as set forth above wherein the DBP R*> numerical value is the 
highest numerical value for all DBP's in step (h) that bind to the target DNA sequence. 

2 ) the method above wherein the DBP numerical value determined in step (h) is at 
least 10.000 

3) the method above wherein the number of ZF domains, ru , is nine. 

4 1 the method above wherein the rules for assigning base-contacting amino acids at 
Z|, Z2 and Z 3 for each nucleotide block in step (e) are selected from rule set A. 

The invention is further directed to a computer system for designing a DBP, with 
multiple ZF domains connected by linker sequences, that binds selectively to a target DNA 
sequence within a given gene, each of said ZF domains having the formula 

A I XCX2-4CA 2 A 3 XFXZ3XXZ 2 LXZ,HX 3 . ? H 
and each of said linkers having the formula 

A4A5X0.2EA6P, 

wherein 

(i) X is any amino acid; (ii) X 2 -4 is a peptide from 2 to 4 amino acids in length; (iii) X3.5 is a 
peptide from 3 to 5 amino acids in length; (iv) X0-2 is a peptide from 0 to 2 amino acids in 
length; (iv) A t is selected from the group consisting of phenylalanine and tyrosine; (v) A 2 is 
selected from the group consisting of glycine and aspartic acid; (vi) A 3 is selected from the 
group consisting of lysine and arginine; (vii) AJs selected from the group consisting of 
threonine and serine; (viii) A 5 is selected from the group consisting glycine and glutamic 
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acid, (ix) A* is selected from the group consisting of lysine and argirune; (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine; (xiii) H is histidine; (xiv) E is glutamic acid; (xv) P 
is proiine; and (xvi) Zj, Z 2 and Z 3 are the base-contacting amino acids, comprising the steps 
or 

(a) setting a genome to be screened, 

(b) selecting the target DNA sequence in the genome for binding; 

(c) setting the number of ZF finger domains to n^, 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains n z nucleotides using a first routine where n z is determined using the 

foil 

relationship: 

n z = 3ru; 

(e) assigning base-contacting amino acids at Z w Z 2 andZ 3 to each ZF domain, 
according to the A Rules and/or B Rules set forth in Tables 1-3, of a DBP which binds to the 
first nucleotide block from step (d) as numbered from the first 5' nucleotide of the target gene 
sequence to generate a block-specific DBP and calculating the binding energy, Binding Energy 
block, of each ZF domain of each such block-specific DBP as the product of the binding 
energies, Binding Energy domaiJW of all domains of the DBP. each determined using the formula: 

Binding Energy domim = (5 x the number of hydrogen bonds) - (2 x the 
number of H 2 0 contacts) + (the number of hydrophobic contacts); 
(0 subdividing the DBP from step (d) into blocks using a second routine to generate a 
subdivided DBP having three ZF domains; 

(g) screening the subdivided DBP from step (0 against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site using the 
following formula: 

Binding Energy = (5 x the number of hydrogen bonds) + (2x 
the number of H 2 0 contacts) + (the number of hydrophobic contacts); 

(h) calculating a ratio of binding energy, R*, using a fourth routine for each nucleotide 
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(i ) repeating steps (f) through (h) for each subdivided DBP wherein ru > 4; 
(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 
sequence containing n z nucleotides; 

(k) rank-ordering R*> numerical values obtained from step (h); 

(1) selecting a DBP with an acceptable R*> value. 

According to the instant invention, Rt>, as defined in (h) above for both the design 
method and computer system, has a lower limit of 10,000. Preferably is greater than 10 6 . 
Preferred embodiments of this aspect of the invention are: 

1 ) the computer system as set forth above wherein the DBP numerical value is the 
highest numerical value for all DBP's in step (h) that bind to the target DNA sequence 

2) the computer system above wherein the DBP numerical value determined in step 
(h) is at least 10,000. 

3 ) the computer system above wherein the number of ZF domains, xu, 

is nine. 

4 ) the computer system above wherein the rules for assigning base-contacting ammo 
acids at Zi, Z2 and Z 3 for each nucleotide block in step (e) are selected from rule set A. 

The method and computer system of the instant invention are further illustrated by the 
block flow diagrams of Figures 4-9. 

Figure 4 shows the components of the computer system on which the DBP design 
process is implemented. A Central Processor Digital Computer (1) of any manufacture is 
provided with a Computer Program (2) written by the inventors. This Computer Program (2) 
reads a series of files described as DNA-Triple Energy Rules (6), Genome Descriptors (9), 
Genomic DNA Sequence (10) and Gene Features (5). The Central Processor (1) transforms 
this information into the DBP Blocking Fragment Files (7) and the Optimal DBP Designs for 
Genome (8). 

Figure 5 shows that the Computer Program (2) in Figure 4 has two portions. The 
genomic data is first transformed by the Process Genome into Blocking Fragment Files 
function (2). These files are then used by the Design DBP's for a Genome function (3). 

The Process Genome into Blocking Fragment Files block (2) of Figure 5 is represented 
in greater detail in Figure 6 For every nj from 1 1 down to 3 the Genome Descriptors file (12) 
and the Genome DNA Sequence file (32) are read and transformed into the Unsorted 
Fragment File (7). This same Unsorted Fragment File (14) is transformed by the Sort function 
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(13) provided by the computer manufacturer into the Sorted Fragment file (15) The same 
Soned Fragment File (30) is read and transformed eventually into the DBP-Size Blocking File 

(22). 

The Design DBP's for a Genome block (3) of Figure 5 is represented in greater detail 
in Figure 7. The Genome Descriptors file (3), the Gene Features file (7), the Genome DNA 
Sequence file (9) and the DBP-Size Blocking Files (37) corresponding to the ru's from 1 1 
down to 3 are read and used to transform the genomic DNA first into genes and then into a 
file of the Optimal DBP Designs for a Genome (38) The transformation and design process is 

done for all the genes in a genome. 

The 'Determine if Current-Sub- Window is in Current-Blocking-File" block (22) in 

Figure 7 is expanded in greater detail in Figure 8. 

The "Calculate Binding-Energy-of-Blocking-Fragment" block (24) in Figure 7 is 

exDanded in greater detail in Figure 9. 

By applying the algorithm to a. variety of DBP's of varying n^, it was experimentally 
determined that a value for ru of 9 is the best starting point in the algorithm, i.e., the process 
should beein with the search for 9-finger DBP's. This can be better understood in terms of the 
selection criterion, Rb, used in evaluating various DBP's. In short DBP's, e.g., ones wherein 
nd= 4 or 5, Binding Energy hi ^, which increases geometrically as the product of all Binding 
Energy dom , m 's, is significantly lower, and Binding Energy SIle „ values are relatively large. 
However, as nj increases, the numerator of R* increases dramatically, while, it has been 
observed, the denominator, representing "background" or "noise," does not significantly 
change Thus, the case of n d = 9 provides assurance of high affinity and specificity of binding 
without also bringing on the possibility of undue computational needs. 

However, it should also be emphasized that the present invention is not limited to the 
desien of DBP's wherein n<iS9- For that matter, it will also be appreciated that, while n,i= 9 
has been found to be the best starting point, the best DBP for a given situation may turn out to 
be one wherein n^ <9, the length of the target DNA sequence notwithstanding. The concept 
of the invention can be applied to the design of DBP's of any length as required. 
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In initial computational experiments, a selectable sequence could have no 8, 7, 6, 5, 
and 4-finger subsites; however, with the present system, only the sum of the subsite binding 
energies need be minimized. As a result, it does not matter whether the subsite binding energy 

comes from 3-finger subsites, 4-fineer subsite* nr n„ , L _ : . 

— w v , 4i ^ imul ^ lc ^ kuj^ci suusues. inis 

simple change from logical exclusion to energetic exclusion has been mandated not so much 
by examination of the yeast genome, but more by examination of the worm genome. 

The central portion of the instant algorithm is, in the case of finding an acceptable n,- 
finger she (e.g., a 27-base segment for a 9-finger DBP), the search against all other n,-finger 
sues in the enure genome to see if there are any similar sites. If such turns out to be the case, 
the DBP with the highest R. value is selected. Furthermore, the algorithm checks to see if 
there are any equivalent 3-finger, 7-finger. 6-finger, 5-finger and 4-finger subsites in the whole 
genome for a given 9-finger s.te. In the event no acceptable 9-finger site is found, the 
algorithm then searches for a su.table 8-finger site. If necessary, the search is continued for a 
7-finger site and so on. until an acceptable DBP binding site is found. 

Within the search for a 9-finger DBP, the algorithm looks at all 27-base sequences, 
which are called "frames " Each frame is evaluated to determine its interaction with DNA and 
the interaction of all other subframes down to 3-finger subsites. The number of instances of 
each frame and subframe m the genome has been recorded dunng the genome process.ng 
phase of the execution of the software. The sequence of the frame or subframe is evaluated as 
a product of the binding energy of each ZF. Each ZF domain recognizes three DNA bases. 
The underlying DNA sequence that a ZF recognizes determines how many hydrogen bonds, 
water contacts and hydrophobic contact exist between the ZF and the DNA. 

The way the algonthm detects whether a given n,-base site occurs in other places in 
the genome « by looking ,n a B-tree for the site. The whole genome is processed for each of 
the n.-finger sites. The algorithm contains means for sorting and merging the myriad 
fragments and, m the end, there is produced an ordered list of all the blocking fragments for all 
the different finger sizes. 



Example 1 

The following is given as an example of how the search for, and design of, a DBP is 
typically carried out. It involves screening for 9-finger DBFs (i.e., * - 9) to bind to a target 
DNA sequence of 100 nucleotides (i.e, N = 100). The sequence is screened, beginning with 
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position 1, for every 27-nucieotide sequence, i.e., 1-27, 2-28, 3-29 etc., in the entire 100- 
nucleotide sequence Once this has been done, the 9-fingers are broken down into 3-finger 
sections, i.e., 1-3, 2-4, 3-5 etc. The algorithm scans and looks for relative strengths of 
bindine. The idea is to maximize the ratio of DBP binding to subsite binding, R*, thus 
eliminating those 9-mers interacting with the greatest numbers of subsites. 

The algorithm of the present invention was applied to the genomes of S. cerevisiae and 
C. elegans as illustrated by the following examples: 

Example 2 

* 

The algorithm has been applied to the screening of the yeast genome. Two 
chromosomes of yeast, containing 1 10 and 447 genes, respectively, have been processed. For 
each gene the algorithm selected the m-finger sequence with the lowest sum of subsite binding 
energies In yeast the number of 3-finger blocking fragments is almost maximal (i.e., 4 , 
versus 4 12 maximal) In the worm genome (see Example 3), the 3-finger blocking sequences 
are absolutely maximal. In yeast the 4-finger blocking sequences are large in number but the 
population of 5-finger blocking sequences is relatively small. In worm the 4-finger blocking 
sequences are larger in number than the 5-finger blocking sequences, but the latter are larger 
in number relative to yeast In going in the future from worm to human, one can expect that 
the 4-finger blocking sequences might come close to saturation (i.e. close to 4 12 ). 

The algorithmic analysis was performed for 2 of the 16 chromosomes of yeast. The 
557 genes in the first two chromosomes seem to present a realistic picture of properties of all 
the chromosomes in the yeast genome. Sample calculations have been run on the whole yeast 
genome but these results are not different from those produced by calculating the properties of 
just two chromosomes' worth of genes. The results of the analysis of 100 yeast genes, typical 
of the findings throughout the analysis of the yeast genome, are presented in Table 4. 

The power of the algorithm is further demonstrated in the results displayed in Figures 
10-14. The figures display results obtained for all 557 genes of the two yeast chromosomes 
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Figure 1 1 shows that the binding energies (Binding Energy block ' s ) of the acceptable 9- 
finger DBP's are uniformly distributed between 10 11 and 10 13 binding units. 

Figure 12 shows that the distribution of the sum of the spurious subsite binding 
energies (Binding Energy u(e 's) is itself uniform in the ranee oflO 6 to in« hin^i™ 

Figure 13 is a nonlogarithmic version of Figure 12. It shows that most of the 
acceptable 9-finger DBP's have spurious subsite binding energies of less than 5xl0 6 . 

Figure 14, produced by taking the ratios of the Figure 1 1 values to those of Figure 12, 
is a graph of the R.'s for the 9-finger DBP's. This chart shows that the ratio of the DBP 
binding strength of the acceptable 9-finger DBP's to the sum of the binding energies of the 
spurious subsite interactions varies from lO 4 to 10 6 . 

The analytical tools of the present invention were also employed in the rurther analysis 
of a single yeast gene. YAR073, in particular the 30O-b P region of the promoter immediately 
upstream of the coding region. The full sums of the subsite binding energies (SBE's) for each 
27-base frame in this portion of the gene were determined; the results are depicted graphically 
in Figure 15 The primary binding energies (BE's) were also determined, and a correlation 
was found between the SBE values and the values of the ratios of BE:SBE (fc). Still rurther 
(Figure 16), it was seen that the peaks of the plot of the R, values correspond to the footprints 
of the transcription factors of the same gene (determined in a separate study). 

Example 3 

Application of the algorithm according to the instant invention to 100 genes in C. 
elegans showed that the system can be applied as successfully to C. elegans as to S. 
cerevisiae. The results of analysis of the 100 C. elegans genes are presented in Table 5. 

In Figure 17, it can be seen that, for one of the analyzed C. elegans genes, only a 5- 
mger DBP could be designed. For another gene, only a 7-fmger DBP could be designed. 
These two genes, 2 and 32, are not seen in Table 5, since it presents results of the analysis 
only for those genes (98 out of 100) for which a 9-mer could be designed. In any event, the 
results depicted in Figure 17 are in keeping with the expectation for analysis of the entire C. 
elegans genome namely, that the distribution of 5- through 9-finger DBP's is somewhat 
different than in S. cerevisiae. 
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Figure 1 8 represents the same analysis for the C. eiegans genes as was depicted in 
Figure 14 for S. cerev,siae genes. Figure 1 8 shows a similar R» value distribution to that seen 
in Figure 14 

Examples 2 and 3 demonstrate the applicability of the instant invention to the design of 
DBP's for the genomes of two widely disparate organisms. The various results of the 
application of the algorithm to the yeast genome, in particular, and also to the worm genome, 
show the power of the algorithmic tool and demonstrate its foundation in reality, i.e., that it 
does not merely provide a random and/or theoretical analysis. It is to be expected, on the 
basis of these analyses, that the inventive algorithm can be extended to the design of DBP's 
for any desired segment of the genome of any organism of interest, including that of a human. 

Although the instant algorithm involves a search against the entire genome of an 
organism, the results of the present studies strongly indicate that lack of complete knowledge 
of the genome of a given organism would not constitute an impediment to application of the 
present invention to the design of DBP's for that organism. One would expect to be able to 
use the knowledge of block sequences obtained in the studies presented here.n on S. cerevmae 
(a unicellular organism) and C. eiegans (a multicellular organism) to form valid estimates of 
allowable sequences for the systems of higher eukaryotes. 

For example, the present studies on yeast and worm indicate that the genomic "noise." 
in this context the spurious binding site energies, is relatively constant, and this can be 
projected to higher, more complex organisms as well. In other words, one would expect from 
the demonstrated combinatorics of DNA sequences to be able to extrapolate, or extend, the 
present algorithm to the analysis of more complex genomes, however much is known of the 
specific sequences therein, with the object of designing effective DBP's. Furthermore, as the 
entire genomes of larger organisms, e.g., D. melanogaster, become known, they will provide 
further keys to the analysis of the genomes of higher organisms, including humans. 

A DBP as specified above may be built by using standard protein synthesis techniques; 
or, employing the standard genetic code, may be used as the basis for specifying and 
constructing a gene whose expression is the DBP. 

Proteins so designed can be used in anv application ^ .; n u, 
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Similarly, in instances in which the target DNA sequence is a promoter, one can 
produce a promoter-specific DBP which, when bound, will act to alter (i.e., enhance, attenuate 
or even terminate) expression of a given gene or, alternatively, genes under control of that 
promoter. 

As another application, a DBP could be designed to bind specific DNA sequences 
when attached to solid supports. Such solid supports could include styrene beads, acrylamide 

well-plates or glass substrates. 

In order to realize the specific applications mentioned above, as well as the full scope 
of applications possible through the instant invention, the DBP can be designed as set forth 
above to include the added feature of a pre- and/or postdomain amino acid sequence of 
arbitrary length This would include, for example, the coupling of the basic DBP to an 
endonuclease or to a reporter or to a sequence by which the DPB could be coupled to a solid 
support. 

Accordingly, the instant invention includes DBP's that bind to a predetermined target 
double-stranded DNA sequence of 3n (where n>l) base pairs in length of the form: 

NH 2 - Xo-m - ZiF, - [{linker} - ZiF;] ... -[{linker} - ZiFJ- X^-COOH 

wherein each ZiFj to ZiF n is a ZF domain of the form set forth above; {linker} is an amino acid 
sequence as set forth above; Xo- m stands for a sequence of from 0 to m amino acids and Xo-p 
stands for a sequence of from 0 to p amino acids. The values for m and p and the identities of 
the amino acids are determined by the particular protein(s) or amino acid sequence(s) to be 
coupled to the DBP for a given application. 

In a further embodiment of the invention, the Zn +2 atom, which forms a complex with 
the two cysteine and two histidine amino acids in a specific ZF motif, can be substituted by a 
Co" 2 or a Cd +2 atom, thus making a "cobalt finger" or a "cadmium finger" 

The rules presented in Table 2 ("rule set A") are to be regarded as the "first choice" 
rules for optimal combinations in ZF-DN A recognition. However, it should be emphasized, as 
indicated in column B ("rule set B") of Table 1 or Table 3, that there are many alternative AA 
combinations that would also be expected to be important in the design of DNA-binding 
proteins capable of forming useful ZF-DNA complexes. 
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What is claimed is: 



PCT/XJS99/03692 



L A method for designing a DBP, with multiple ZF domains connected by linker sequences, 
that binds selectively to a target DNA sequence within a given gene 7 each of said ZF 
domains having the formula 

A 1 XCX 2 ^CA 2 A 3 XFXZ 3 XXZ 2 LXZ } HX3. 5 H 
and each of said linkers having the formula 

A4A5X0-2EA6 P, 
wherein 

(i) X is any amino acid; (ii) X 2 uis a peptide from 2 to 4 amino acids in length; (iii) X 3 . 5 is a 
peptide from 3 to 5 amino acids in length; (iv) X 0 . 2 is a peptide from 0 to 2 amino acids in 
leneth; (iv) Ai is selected from the group consisting of phenylalanine and tyrosine, (v) A 2 is 
selected from the group consisting of glycine and aspartic acid; (vi) A 3 is selected from the 
group consisting of lysine and arginine; (vii) A* is selected from the group consisting of 
threonine and serine; (viii) A 5 is selected from the group consisting glycine and glutamic 
acid; (ix) A* is selected from the group consisting of lysine and arginine; (x) C is cysteine, 
fxi) F is phenylalanine, (xii) L is leucine; (xiii) H is histidine; (xiv) E is glutamic acid; (xv) 
P is proline; and (xvi) Z u Z 2 and Z? are the base-contacting amino acids, which method 
comprises an algorithm comprising the steps of: 

(a) setting a genome to be screened; 

(b) selecting the target DNA sequence in the genome for binding, 

(c) setting the number of ZF domains to eu; 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains n T nucleotides using a first routine where n z is determined using the 

Tr-, , Kicp -f-vnf 'ji-Tint? amiru > ;ici(1< ,i T / . ;u\ii ■■ \> earn / f- domain 

according to the A Rules and/or B Rules set forth in Tables 1-3 of the specification, of a DBP 
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which binds to the first nucleotide block from step (d) as numbered from the first 5' nucleotide 
of the target gene sequence to generate a block-specific DBP and calculating the binding 
energy, Binding Energy bIock , of each ZF domain of each such block-specific DBP as the 
product of the binding energies, Binding Energy domiul , of all ZF domains of the DBP. each 
determined using the formula: 

Binding Energy donuin = (5 x the number of hydrogen bonds) + (2 x the 
number of H 2 0 contacts) + (the number of hydrophobic contacts); 
(0 subdividing the DBP from step (d) into blocks using a second routine to generate a 

subdivided DBP having three ZF domains; 
(g) screening the subdivided DBP from step (f) against the genome using a third 

routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site usino the 
following formula: 

Binding Energy Slten = (5 x the number of hydrogen bonds) + (2x 
the number of H 2 0 contacts) + (the number of hydrophobic contacts); 
fh) calculating a ratio of binding energy, R,, using a fourth routine for each nucleotide 
block-specific DBP from step (e) using the following formula: 

Rb = Binding Energy block /the sum of all Binding Energy i)ten 's for all 
subdivided DBP's from step (g); 

(i) repeating steps (0 through (h) for each subdivided DBP wherein a, > 4; 
(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 
sequence containing n z nucleotides, 

(k) rank-ordering R*, numerical values obtained from step (h); and 

(1) selecting a DBP with an acceptable value. 

2. The method of claim 1 wherein the DBP selected is that whose R* numerical value is 
the highest numerical value for all DBP's in step (h) that bind to the target DNA 
sequence. 



3. 



The method of claim 1 wherein the DBP R* numerical value determined in step (h) is at 
least 10,000. 
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4. The method of claim 1 wherein the number of ZF domains, rvj, 
is nine. 

5. The method of claim 1 wherein the rules for assigning base-contacting amino acids at 
Zi, Z 2 and Z 3 for each nucleotide block in step (e) are selected from rule set A. 

6. The method of claim 1 wherein the rules for assigning base-contacting amino acids at 
Zi, Z 2 and Z 3 for each nucleotide block in step (e) are selected from rule set B. 

7. The method of claim 1 wherein rules for assigning base-contacting amino acids at Zi, 
Z 2 and Z 3 for each nucleotide block in step (e) are a combination selected from rule sets 
AandB. 

8. A computer system for designing a DBP, with multiple ZF domains connected by 
linker sequences, that binds selectively to a target DNA sequence within a given gene, each of 
said ZF domains having the formula 

A l XCX2^CA 2 A 3 XFXZ 3 XXZ 2 LXZ l HX3-5H 
and each of said linkers having the formula 

A4A5X0-2EA6P, 

wherein 

(i) X is any amino acid; (ii) X 7 u is a peptide from 2 to 4 amino acids in length; (iii) X 3 . 3 is a 
peptide from 3 to 5 amino acids in length; (iv) Xo- 2 is a peptide from 0 to 2 amino acids in 

group consisting ot lysine ana arginine, yvu) A4 is seieueu nuin Uic ^luup consisting ci 
threonine and serine; (viii) A5 is selected from the group consisting glycine and glutamic 

4V 
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acid; fix) A« is selected from the group consisting of lysine and arginine; (x) C is cysteine; 
(xi) F is phenylalanine; (xii) L is leucine; (xiii) H is hisudine; (xiv) E is glutamic acid; (xv) P 
is proline; and (xvi) Z,, Z, and Z 3 are the base-contacting ammo acids, which computer 
system comprises means for design which include an algorithm comprising the steps of: 
(a) setting a genome to be screened; 

fb) selecting the target DNA sequence in the genome for binding; 
<c) setting the number of ZF domains to n<i; 

(d) dividing the target DNA sequence into nucleotide blocks wherein each block 
contains n z nucleotides using a first routine where n z is determined using the 
following relationship: 

n : = 3n<i; 

(e) assigning base-contacting amino acids at Z,, Z 2 and Z, to each ZF domain, 
according to the A Rules and/or B Rules set forth in Tables 1-3 of the specification, of a DBP 
which binds to the first nucleotide block from step (d) as numbered from the first 5' nucleotide 
of the target gene sequence to generate a block-specific DBP and calculating the binding 
energy, Binding Energy block , of each ZF domain of each such block-specific DBP as the 
product of the binding energies, Binding Energy donuin , 0 f all ZF domains of the DBP, using the 
formula: 

Binding Energy domiin = (5 x the number of hydrogen bonds) + (2 x the 
number of H 2 0 contacts) + (the number of hydrophobic contacts); 

(f) subdividing the DBP from step (d) into blocks using a second routine to generate a 
subdivided DBP having three ZF domains; 

(g) screening the subdivided DBP from step (f) against the genome using a third 
routine to determine the number of binding sites in the genome for each subdivided 
DBP in the genome and assigning a binding energy for each such site using the 
following formula. 

Binding Energy. ite0 = (5 x the number of hydrogen bonds) + (2 x 
the number of H 2 0 contacts) + (the number of hydrophobic contacts); 

(h) calculating a ratio of binding energy, us ing a fourth routine for each nucleotide 
block-specific DBP from step (e) using the following formula: 

R* = Binding Energy blosk /the sum of all Binding Energy** ,'s for all subdivided 
DBP's from step (g); 
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(i ) repeating steps (f) through (h) for each subdivided DBP wherein n<j >4 ; 
(j) repeating steps (d) through (i) for each nucleotide block in the target DNA 
sequence containing n z nucleotides; 

(k) rank-ordering R* numerical values obtained from step (h); and 

(1) selecting a DBP with an acceptable R*> value. 

9. The computer system according to claim 8 wherein the DBP selected is that whose Rb 
numerical value is the highest numerical value for all DBP's in step (h) that bind to the target 
DNA sequence 

10. The computer system according to claim 8 wherein the DBP numerical value 
determined in step (h) is at least 10,000. 

11. The computer system according to claim 8 wherein the number of ZF domains, n<i, is 
nine. 

12. The computer system according to claim 8 wherein the rules for assigning base- 
contacting amino acids at Z u Z 2 and Z 3 for each nucleotide block in step (e) are selected from 
rule set A 

13. The computer system according to claim 8 wherein the rules for assigning base- 
contacting ammo acids at Zi, Z 2 and Z 3 for each nucleotide block in step (e) are selected from 
ruie set B 



14. The computer system according to claim 8 wherein the rules for assigning base- 
contacting amino acids at Z u Z 2 and Z 3 for each nucleotide block in step (e) are a combination 
selected from rule sets A and B. 
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